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Abstract 

Stimuli outside classical receptive fields have been shown to exert significant influence over 
the activities of neurons in primary visual cortex. We propose that contextual influences 
are used for pre-attentive visual segmentation, in a new framework called segmentation 
without classification. This means that segmentation of an image into regions occurs with- 
out classification of features within a region or comparison of features between regions. 
This segmentation framework is simpler than previous computational approaches, making 
it implementable by VI mechanisms, though higher level visual mechanisms are needed 
to refine its output. However, it easily handles a class of segmentation problems that 
are tricky in conventional methods. The cortex computes global region boundaries by de- 
tecting the breakdown of homogeneity or translation invariance in the input, using local 
intra-cortical interactions mediated by the horizontal connections. The difference between 
contextual influences near and far from region boundaries makes neural activities near 
region boundaries higher than elsewhere, making boundaries more salient for perceptual 
pop-out. This proposal is implemented in a biologically based model of VI, and demon- 
strated using examples of texture segmentation and figure-ground segregation. The model 
performs segmentation in exactly the same neural circuit that solves the dual problem of 
the enhancement of contours, as is suggested by experimental observations. Its behavior is 
compared with psychophysical and physiological data on segmentation, contour enhance- 
ment, and contextual influences. We discuss the implications of segmentation without 
classification and the predictions of our VI model, and relate it to other phenomena such 
as asymmetry in visual search. 
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1. Introduction 

In early stages of the visual system, individ- 
ual neurons respond directly only to stimuli in 
their classical receptive fields (RFs)(Hubel and 
Wiesel, 1962). These RFs sample the local con- 
trast information in the input but are too small 
to cover visual objects at a global scale. Re- 
cent experiments show that the responses of pri- 
mary cortical (VI) cells are significantly influ- 
enced by stimuli nearby and beyond their clas- 
sical RFs (Allman, Miezin, and McGuinness, 
1985, Knierim and Van Essen 1992, Gilbert, 
1992, Kapadia, Ito, Gilbert, and Westheimer 
1995, Sillito et al 1995, Lamme, 1995, Zipser, 
Lamme, and Schiller 1996, Levitt and Lund 
1997). These contextual influences are in gen- 
eral suppressive and depend on whether stimuli 
within and beyond the RFs share the same ori- 
entation (Allman et al, 1985, Knierim and Van 
Essen 1992, Sillito et al 1995, Levitt and Lund 
1997). In particular, the response to an opti- 
mal bar in the RF is suppressed significantly by 
similarly oriented bars in the surround — iso- 
orientation suppression (Knierim and Van Es- 
sen 1992). The suppression is reduced when the 
orientations of the surround bars are random or 
different from the bar in the RF (Knierim and 
Van Essen 1992, Sillito et al 1995). However, if 
the surround bars are aligned with the optimal 
bar inside the RF to form a smooth contour, 
then suppression becomes facilitation (Kapadia 
et al 1995). The contextual influences are ap- 
parent within 10-20 ms after the cell's initial re- 
sponse( Knierim and Van Essen 1992, Kapadia 
et al 1995), suggesting that mechanisms within 
VI itself are responsible (see discussion later on 
the different time scales observed by Zipser et 
al 1996). Horizontal intra-cortical connections 
linking cells with non-overlapping RFs and sim- 
ilar orientation preferences have been observed 
and hypothesized as the underlying neural sub- 
strate(Gilbert and Wiesel, 1983, Rockland and 
Lund 1983, Gilbert, 1992). While the phenom- 
ena and the mechanisms of the contextual in- 
fluences are studied experimentally and in some 
models (e.g., Somers Todorov, Siapas, and Sur 



1995, Stemmler, Usher, and Niebur, 1995), in- 
sights into their computational roles have been 
limited to mainly contour or feature linking (All- 
man et al 1995, Gilbert, 1992, see more refer- 
ences in Li 1998). 

We propose that contextual influences serve 
the goal of pre-attentive visual segmentation or 
grouping to infer global visual objects such as re- 
gions and contours from local RF features. Lo- 
cal features can group into regions, as in texture 
segmentation; or into contours which may repre- 
sent boundaries of underlying objects. We show 
how region segmentation emerges from a sim- 
ple but biologically-based model of VI with only 
local cortical interactions between cells within a 
few RF sizes away from each other. Note that al- 
though the horizontal intra-cortical connections 
are termed as long range, they are still local with 
respect to the whole visual field since the ax- 
ons reach only a few millimeters, or a few hy- 
percolumns or receptive field sizes, away from 
the pre-synaptic cells. To attack the formidable 
problem of segmentation in such a low level vi- 
sual area, we introduce a new computational 
framework — segmentation without classifica- 
tion. 

2. The problem of visual segmentation 

Visual segmentation is defined as locating the 
boundary between different image regions. For 
example, when regions are defined by their pixel 
luminance values, center-surround filters in the 
retina can locate boundaries between regions by 
comparing the classification (in this case, lumi- 
nance) values between neighboring image areas. 
In general, regions are seldom classifiable by 
pixel luminances, and image filters are mainly 
to extract image features rather than to segment 
image regions (Haralick and Shapiro, 1992). For 
general region segmentation, previous compu- 
tational approaches have always assumed, im- 
plicitly or explicitly, that segmentation requires 
(1) feature extraction or classification for ev- 
ery small image area, and, (2) comparisons of 
the classification flags (feature values) between 
neighboring image areas to locate the boundary 
as where the classification flags change (Haral- 



ick and Shapiro, 1992, Bergen 1991, Bergen and 
Adelson, 1988, Malik and Perona 1990). This 
framework can be summarized as segmentation 
by classification. Over the years, many such seg- 
mentation algorithms have been developed both 
for computer vision (Haralick and Shapiro, 1992) 
and to model natural vision (Bergen and Adel- 
son, 1988, Malik and Perona 1990). They are 
all forms of segmentation by classification, and 
differ chiefly as to how the region features are ex- 
tracted and classified, whether it is by, e.g., im- 
age statistics by pixel correlations, or the model 
parameters in the Markov random fields gener- 
ating the image (Haralick and Shapiro, 1992), or 
the outcomes from model neural filters ( Bergen 
and Adelson, 1988) or model neural interactions 
(Malik and Perona 1990). In such approaches, 
classification is problematic and ambiguous near 
boundaries between regions. This is because fea- 
ture evaluations can only be performed in lo- 
cal areas of the image which are assumed to be 
sitting away from region boundaries, i.e., fea- 
ture classification presumes some degree of re- 
gion segmentation. A priori, the locations of 
the region boundaries are not known, and so 
the feature values at these places will be am- 
biguous. The probability that this ambiguity 
happens can be reduced by choosing smaller im- 
age areas for feature evaluation. However, this 
in turn gives less accurate feature values, espe- 
cially in textured regions, and can lead to there 
being significant differences in the feature val- 
ues even within a region, and thus to false re- 
gion boundaries. One seems inevitably to face a 
fundamental dilemma — classification presumes 
segmentation, and segmentation presumes clas- 
sification. 

This dilemma can be dissolved by recognizing 
that segmentation does not presume classifica- 
tion. Natural vision can segment the two re- 
gions in Fig. 1 even though they have the same 
texture features (note that the plotted area is 
only a small part of an extended image). In 
this case, classification of the region features is 
neither sufficient, nor necessary, and segmenta- 
tion is rather triggered by the sudden changes 



near the region boundary which is problematic 
in traditional approaches. In fact, even with dis- 
tinguishable classification flags for all image ar- 
eas in any two regions (such as the '|' and ' ' 
in Fig. 3A), segmentation is not completed un- 
til another processing step locates the boundary, 
perhaps by searching for where the classification 
flags change. We propose that segmentation at 
its pre-attentive bare minimum is segmentation 
without classification, i.e., segmentation with- 
out explicitly knowing the feature contents of 
the regions. This simplifies the segmentation 
process conceptually, making it plausible that 
it can be performed by low level processings in 
VI. This paper focuses on pre-attentive segmen- 
tation. Additional processing is likely needed 
to improve the resulting segmentation, e.g., by 
refining the coarse boundaries detected at the 
pre-attentive stage and classifying the contents 
of the regions. 

3. The principle and its implementation 

The principle of segmentation without classifi- 
cation is to detect region boundaries by detect- 
ing the breakdown of translation invariance in 
inputs. A single image region is assumed to be 
defined by the homogeneity or translation in- 
variance of the statistics of the image features, 
no matter what the features are, or, for instance, 
whether they are colored red or blue or whether 
or not the texture elements are textons (Julesz, 
1981). In general, this translation invariance 
should include cases such as the image of a sur- 
face slanted in depth, although the current im- 
plementation of the principle has not yet been 
generalized beyond images of fronto-parallel sur- 
faces. Homogeneity is disrupted or broken at the 
boundary of a region. In segmentation without 
classification, a mechanism signals the location 
of this disruption without explicitly extracting 
and comparing the features in image areas. 

This principle is implemented in a model of 
VI. Without loss of generality, the model fo- 
cuses on texture segmentation, i.e., segmenta- 
tion without color, motion, luminance, or stereo 
cues. To focus on the segmentation problem, the 
model includes mainly layer 2-3 orientation se- 



lective cells and ignores the mechanism by which 
their receptive fields are formed. Inputs to the 
model are images filtered by the edge- or bar- 
like local RFs of VI cells. (The terms 'edge' 
and 'bar' will be used interchangeably.) The re- 
sulting bar inputs are merely image primitives, 
which are in principle like image pixel primi- 
tives and are reversibly convertible from them. 
They are not texture feature values, such as the 
'+' or 'x' patterns in Fig. 6D and the statistics 
of their spatial arrangements, or the estimated 
densities of bars of particular orientations, from 
which one can not recover the original input im- 
ages. To reiterate, this model does not extract 
texture features in order to segment 1 . To avoid 
confusions, this paper uses the term 'edge' only 
for local luminance contrast, a boundary of a re- 
gion is termed 'boundary' or 'border' which may 
or may not (especially for texture regions) cor- 
respond to any 'edges' in the image. The cells 
influence each other contextually via horizontal 
intra-cortical connections (Rockland and Lund 
1983, Gilbert and Wiesel, 1983, Gilbert, 1992), 
transforming patterns of inputs to patterns of 
cell responses. If cortical interactions are trans- 
lation invariant and do not induce spontaneous 
pattern formation (such as zebra stripes (Mein- 
hardt, 1982)) through the spontaneous break- 
down of translation symmetry, then the corti- 
cal response to a homogenous region will itself 
be homogeneous. However, if there is a region 
boundary, then two neurons, one near and an- 
other far from the boundary will experience dif- 
ferent contextual influences, and thus respond 
differently. In the model, the cortical interac- 
tions are designed (see below) such that the ac- 
tivities of neurons near the boundaries will be 
relatively higher. This makes the boundaries 
relatively more salient, allowing them to pop 
out perceptually for pre-attentive segmentation. 



1 In practice, in the presence of noise, it is not possible 
to uniquely reconstruct the original pixel values in the input 
image from the 'edge' and 'bar' variables. For simplicity, the 
current implementation has not enforced this reversibility. 
However, the principle of no classification is adhered to by 
not explicitly comparing (whether by differentiation or other 
related manners) the 'edge' and 'bar' values between image 
areas to find region boundaries 



Experiments in VI indeed show that activity 
levels are robustly higher near simple texture 
boundaries only 10-15 msec after the initial cell 
responses (Nothdurft, 1994, Gallant, Van Essen, 
and Nothdurft 1995). 

Fig. 2 shows the elements of the model and 
their interactions. At each location i there is a 
model VI hypercolumn composed of K neuron 
pairs. Each pair (i, 9) has RF center i and pre- 
ferred orientation 9 = kn/K for k = 1,2, ...K, 
and is called (a neural representation of) an edge 
segment. Based on experimental data (White, 
1989, Douglas and Martin 1990), each edge seg- 
ment consists of an excitatory and an inhibitory 
neuron that are connected with each other, and 
each model cell represents a collection of local 
cells of similar types. The excitatory cell re- 
ceives the visual input; its output quantifies the 
response or salience of the edge segment and 
projects to higher visual areas. The inhibitory 
cells are treated as interneurons. An edge of 
input strength I i/3 at i with orientation (5 in 
the input image contribute to 1^ by an amount 
Iip(f>{d - P), where <f>{9 - f3) = e — 1^— /3| /(^/») [ s 
the cell's orientation tuning curve. Based on 
observations by Gilbert, Lund and their col- 
leagues (Gilbert and Wiesel, 1983, Rockland 
and Lund, 1983, Hirsch and Gilbert, 1991), hori- 
zontal connections Jie,je' (resp. Wigjgi) mediate 
contextual influences via monosynaptic excita- 
tion (resp. disynaptic inhibition) from bar j9' 
to i9 which have nearby but different RF cen- 
ters, i 7^ j, and similar orientation preferences, 
9 ~ 9'. The membrane potentials follow the 
equations: 

x ie = -a x x i9 - J2 ip(A9)g y (yi : e + A9) + Jo9x{xie) 
Ae 

+ 5Z Ji0,j0'9x( X j0') + Ii0 + I (1) 

Vie = -oiyVie + g x {xie) 

+ Y. Wie, je >gx(xje>) + Ic (2) 

where a x Xig and a y yiQ model the decay to rest- 
ing potential, g x (x) and g y (y) are sigmoid-like 
functions modeling cells' firing rates in response 
to membrane potentials x and y, respectively, 



tp(A9) is the spread of inhibition within a hyper- 
column, J g x { x ie) is self excitation, I c and I are 
background inputs, including noise and inputs 
modeling the general and local normalization of 
activities (Heeger, 1992) (see Li (1998) for more 
details). Visual input 1^ persists after onset, 
and initializes the activity levels g x (xie). Equa- 
tions (1) and (2) specify how the activities are 
then modified (effectively within one membrane 
time constant) by the contextual influences. De- 
pending on the visual stimuli, the system of- 
ten settles into an oscillatory state (Gray and 
Singer, 1989, Eckhorn, Bauer, Jordan, Brosch, 
Kruse, Munk, and Reitboeck 1988), an intrin- 
sic property of a population of recurrently con- 
nected excitatory and inhibitory cells (see Li 
(1998) for detailed parameters and dynamic 
analysis of the model). Temporal averages of 
9x(xie) over several oscillation cycles (about 12 
to 24 membrane time constants) are used as the 
model's output. If the maxima over time of the 
responses of the cells were used instead as the 
model's output, the boundary effects shown in 
this paper would usually be stronger. That dif- 
ferent regions occupy different oscillation phases 
could be exploited for segmentation (Li, 1998), 
although we do not do so here. The nature of 
the computation performed by the model is de- 
termined largely by the horizontal connections 
J and W. 

For view-point invariance, the connections are 
local, and translation and rotation invariant 
(Fig. 2B), i.e., every pyramidal cell has the 
same horizontal connection pattern in its ego- 
centric reference frame. The synaptic weights 
are designed for the segmentation task while 
staying consistent with experimental observa- 
tions (Rockland and Lund 1983, Gilbert and 
Wiesel, 1983, Hirsch and Gilbert 1991, Weliky, 
Kandler, Fitzpatrick, and Katz 1995). In partic- 
ular, J and W are chosen to satisfy the follow- 
ing three conditions (Li, 1997): (1) the system 
should not generate patterns spontaneously, i.e., 
homogenous input images give homogenous out- 
puts, so that no illusory borders occur within 
a single region, (2) neurons at region borders 



should give relatively higher responses, and (3) 
the same neural circuit should perform contour 
enhancement. Condition (3) is not only required 
by physiological facts (Knierim and Van Essen 
1992, Kapadia et al, 1995), but is also desirable 
because regions and their boundary contours 
are complementary. The qualitative structure 
of the connection pattern satisfying the condi- 
tions is shown in Fig. 2B, and is thus a predic- 
tion of our model (see Appendix and Li (1998) 
for its derivation). Qualitatively, the connec- 
tion pattern resembles a "bow tie": J predom- 
inantly links cells with aligned RFs for contour 
enhancement, and W predominantly links cells 
with non-aligned RFs for surround suppression. 
Both J and W link cells with similar orientation 
preferences, as observed experimentally (Rock- 
land and Lund 1983, Gilbert and Wiesel 1983, 
Hirsch and Gilbert 1991, Weliky et al, 1995), 
and their magnitudes decay with distance be- 
tween RFs(Li, 1998). 

Mean field techniques and dynamic stability 
analysis (shown in Appendix) are used to de- 
sign the horizontal connections that ensure the 
3 conditions above. Conditions (1) and (2) are 
strictly met only for (the particularly homoge- 
nous) inputs lie within a region that are inde- 
pendent of i, i.e., exactly the same inputs are 
received at each grid point. When a region re- 
ceives more complex input texture patterns such 
as in stochastic or sparse texture regions (e.g., 
those in Fig. (6)), conditions (1) and (2) are 
often met but not guaranteed. This is not nec- 
essarily a flaw in this model, since it is not clear 
whether conditions (1) and (2) can always be 
met for any types of homogenous inputs within 
a region under the hardware constraints of the 
model or the cortex. This is consistent with 
the observations that sometimes a texture region 
does not pop out of a background pre-attentively 
in human vision (Bergen 1991). A range of 
quantitatively different connection patterns can 
meet our 3 restrictive conditions. Of course, this 
range depends on the particular structure and 
parameters of the model such as its receptive 
field sampling density. This makes our model 



quantitatively imprecise compared to physiolog- 
ical and psychophysical observations (see discus- 
sions later). 

4. Performance of the model 

The model was applied to a variety of input 
textures, as shown in examples in the figures. 
With two exceptions, the input values 1^ is the 
same for all visible bars in each example so that 
any difference in the outputs g x (xie) of the bars 
are solely due to the effects of the intra-cortical 
interactions. The exceptions are the input taken 
from a photo (Fig. 10), and the input in Fig. 
(9D) which models an experiment on contour 
enhancement (Kapadia et al 1995). The differ- 
ence in the outputs, which are interpreted as a 
difference in saliencies, are significant about one 
membrane time constant after the initial neu- 
ral response (Li, 1998). This agrees with ex- 
perimental observations (Knierim and van Essen 
1992, Kapadia et al 1995, Gallant et al 1995) if 
this time constant is assumed to be of order 10 
msec. The actual value 1^ used in all examples 
are chosen to mimic the corresponding experi- 
mental conditions. In this model the dynamic 
range is he = (1-0,4.0) for an isolated bar to 
drive the excitatory neuron from threshold ac- 
tivation to saturation. Hence, we use he = 1.2, 
2.0, and 3.5 for low, intermediate, and high con- 
trast input conditions used in experiments. Low 
input levels are used to demonstrate contour en- 
hancement — the visible bars in Figs. (7B) and 
the target bar in Fig. (9D) (Kapadia et al 1995, 
Kovacs and Julesz 1993). Intermediate levels 
are used for all visible bars in texture segmenta- 
tion and figure-ground pop-out examples (Figs. 
(3, 4, 5, 6, 7A, and 8)). High input levels are 
used for all visible bars in Fig. (9A,B,C) and 
the contextual (background) bars in Fig. (9D) 
to model the high contrast conditions used in 
physiological experiments that study contextual 
influence from textured and/or contour back- 
grounds (Knierim and van Essen 1992, Kapadia 
et al 1995). The input he from a photo im- 
age (Fig. (10)) is different for different i6 with 
he < 3.0. The output saliency g x (xie) ranges in 
[0, 1]. The widths of the bars in the figures are 



proportional to input or output strengths. The 
same model parameters (e.g., the dependence of 
the synaptic weights on distances and orienta- 
tions, the thresholds and gains in the functions 
g x (.) and g y (.), and the level of input noises in 
I ) are used for all the examples whether it is for 
the texture segmentation, contour enhancement, 
figure-ground segregation, or combinations of 
them. The only difference between different ex- 
amples are the differences in the model inputs 
he and possibly the different image grid struc- 
ture (Manhattan or orthogonal grids) for bet- 
ter input sampling. All the model parameters 
needed to reproduce the results are listed in the 
Appendix of the reference (Li, 1998). 

Fig. 3A shows a sample input containing 
two regions. Fig. 3B shows the model out- 
put. Note that the plotted region is only a 
small part of, and extends continuously to, a 
larger image. This is the case for all figures 
in this paper except Fig. (10). Fig. 3C plots 
the average saliency S(c) of the bars in each 
column c in Fig. 3B, indicating that the most 
salient bars are indeed near the region bound- 
ary. Fig. 3D confirms that the boundary can be 
identified by thresholding the output activities 
using a threshold, denoted as, say, thre = 0.5 
in Fig. 3D, the fraction of the highest out- 
put m&Xie{g x (xie)} in the image. Note that VI 
does not perform such thresholding, it is per- 
formed only for display purposes. Also, the 
value of the threshold is example dependent for 
better visualization. To quantify the relative 
saliency of the boundary, define the net saliency 
at each grid point i to be that of the most ac- 
tivated bar (maxe{g x (xie)}), let S pea k be aver- 
age saliency across the most salient grid column 
parallel and near the boundary, and S and a s be 
the mean and standard deviation in the salien- 
cies of non-boundary locations, defined as being 
at least (say) 3 grid units away from the bound- 
ary. Define (r = S peak /S, d = (S peak - S)/a s ). 
A salient boundary should give large values for 
(r, d). One expects that at least one of r and 
d should be comfortably larger than 1 for the 
boundaries to be adequately salient. In Fig. 



(3), (r,d) = (4.5,15.0). Notes that the vertical 
bars near the boundary are more salient than 
the horizontal ones. This is because the vertical 
bars run parallel to the boundary, and are there- 
fore specially enhanced through the contour en- 
hancement effect of the contextual influences. 
This is related to the psychophysical observation 
that texture boundaries are stronger when the 
texture elements on one side of them are paral- 
lel to the boundaries (Walkson and Landy 1994). 
Fig (4A) shows an example with the same ori- 
entation contrast (90°) at the boundary but dif- 
ferent orientations of the texture bars. Here the 
saliency values distribute symmetrically across 
the boundary and the boundary strength is a lit- 
tle weaker. These model behaviors can be phys- 
iologically tested. 

Fig. 4 shows examples using other ori- 
entations of the texture bars. The bound- 
ary strength decreases with decreasing orienta- 
tion contrast at the region border. It is very 
weak when the orientation contrast is only 15° 
(Fig.(4C)) — here translation invariance in in- 
put is only weakly broken, making the bound- 
ary very difficult to detect pre-attentively. Note 
also that the most salient location in an im- 
age may not be exactly on the boundary (Fig. 
4B), this should lead to a bias in the estima- 
tion of the border location, as can be experi- 
mentally tested. This also suggests that outputs 
from pre-attentive segmentation need to be pro- 
cessed further by the visual system. The bound- 
ary strength also decreases if the orientations of 
the texture elements are somewhat random or 
the spacing between the elements increases (Fig. 
(5)). Boundary detection is difficult when ori- 
entation noise > 30° or when the spacing be- 
tween bar elements is more than 4 or 5 grid 
points (or texture element sizes). These qual- 
itative and quantitative results (on the cut off 
orientation contrast, orientation noise, and bar 
spacings) compare quite well with human per- 
formance on segmentation related tasks (Noth- 
durft 1985, 1991). 

This model also copes well with textures de- 
fined by complex or stochastic patterns (Fig. 



(6)). In both Figs. 6A and 6B, the neighbor- 
ing regions can be segmented even though they 
have the same bar primitives and densities. In 
particular, the two regions in Fig. 6A have ex- 
actly the same features, just like that in Fig. 1, 
and would be difficult to segment using tradi- 
tional approaches. 

When a region is very small, all parts of it be- 
long to the boundary and it pops out from the 
background, as in Fig. 7A. In addition, Fig. 7B 
confirms that exactly the same model, with the 
same elements and parameters, can also high- 
light contours against a noisy background — 
another example of a breakdown of translation 
invariance. 

Our model also accounts for the asymmetry 
in pop-out strength observed in psychophysics 
(Treisman and Gormican, 1988), i.e., item A 
pops out among item B more easily than vice 
versa. Fig. (8) demonstrates such an example 
where a cross among bars pops out much more 
readily than a bar among crosses. Such asym- 
metry is quite natural in our framework — the 
nature of breakdown of translation invariance in 
the input is quite different depending on which 
one is the figure or background. 

The model replicates the results of physiolog- 
ical experiments on contextual influences from 
beyond the classical receptive fields (Knierim 
and van Essen 1992, Kapadia et al, 1995). In 
particular, Fig. (9A,B,C,D) demonstrate that 
the response of a neuron to a bar of preferred 
orientation in its receptive field is suppressed by 
a textured surround but enhanced by colinear 
contextual bars that form a line. As experimen- 
tally observed (Knierim and van Essen 1992), 
suppression in the model is strongest when the 
surround bars are of the same orientation as 
the center bar, is weaker when the surround 
bars have random orientations, and is weakest 
when the surround bars are oriented orthogo- 
nally to the center bar. The relative degree of 
suppression is quantitatively comparable to that 
of the orientation contrast cells observed physi- 
ologically (Knierim and van Essen 1992). Sim- 
ilarly, Fig. (9D) closely simulates the enhance- 



ment effect observed physiologically (Kapadia et 
al 1995) when bars in the surround are aligned 
with the central bar to form a line. 

5. Summary and discussions 

Summary of the results 

This paper makes two main contributions. 
First, we propose a computational framework 
for pre-attentive segmentation — segmentation 
without classification. Second, we present a bi- 
ologically based model of VI which implements 
the framework using contextual influences, and 
we thereby demonstrate the feasibility of the 
framework. 

Since it does not rely on classification, our seg- 
mentation framework is simpler than traditional 
methods, which explicitly or implicitly require 
classification. Consequently, not only can our 
framework be implemented using lower level vi- 
sual mechanisms as in VI, but also, it avoids the 
dilemma which plagues the traditional compu- 
tational approaches — segmentation presumes 
classification, classification presumes segmenta- 
tion. A further consequence is that, our frame- 
work can easily handle some segmentation ex- 
amples such as those in Fig. (6A), for which the 
two regions have the same classification values, 
that pose problems for the traditional compu- 
tational approaches, but are easily segmentable 
by human pre-attentive vision. 

Since the computational framework is new, 
this is the first model of VI that captures the 
effect of higher neural activities near region 
boundaries, as well as its natural consequence 
of pop-out of small figures against backgrounds 
and asymmetries in pop-out strengths between 
choices of figure and ground. The mechanism of 
the model is the local intra-cortical interactions 
that modify individual neural activities depend- 
ing on the contextual visual stimuli, thus de- 
tecting the region boundaries by detecting the 
breakdown of translation invariance in inputs. 
Furthermore, our model is the first to use the 
same neural circuit for both the region bound- 
ary effect and contour enhancement — individ- 
ual contours in a noisy or non-noisy background 
can also seen as examples of the breakdown of 



translation invariance in inputs. Putting these 
effects together, VI is modeled as a saliency 
network that highlights the conspicuous image 
areas in inputs. These conspicuous areas in- 
clude region boundaries, and smooth contours or 
small figures against backgrounds, thus serving 
the purpose of pre-attentive segmentation. This 
VI model, with its intra-cortical interactions de- 
signed for pre-attentive segmentation, success- 
fully explains the contextual influences beyond 
the classical receptive fields observed in phys- 
iological experiments (Knierim and van Essen 
1992, Kapadia et al 1995). Hence, we suggest 
that one of the roles of contextual influences is 
pre-attentive segmentation. 

Relation to other studies 

It has recently been argued that texture anal- 
ysis is performed at low levels of visual process- 
ing (Bergen, 1991) — indeed filter based models 
(Bergen and Adelson 1988) and their non-linear 
extensions (e.g., Malik and Perona (1990)) cap- 
ture well much of the phenomenology of psy- 
chophysical performance. However, all the pre- 
vious models are in the traditional framework 
of segmentation by classification, and thus differ 
from our model in principle. For example, the 
texture segmentation model of Malik and Per- 
ona (1990) also employs neural-like (albeit much 
less realistic) interactions in a parallel network. 
However, their interactions are designed to clas- 
sify or extract region features. Consequently, 
the model requires a subsequent feature compar- 
ison operation (by spatial differentiation) in or- 
der to segment. It would thus have difficulties in 
cases like Fig. (1), and would not naturally cap- 
ture figure pop-out, asymmetries between the 
figure and ground, or contour enhancement. 

By locating the conspicuous image locations 
without specific tuning to (or classification of) 
any region features, our model is beyond early 
visual processing using center-surround filters or 
the like (Marr, 1982). While the early stage 
filters code image primitives (Marr, 1982), our 
mechanism should help in object surface repre- 
sentation. Since they collect contextual influ- 
ences over a neighborhood, the neurons natu- 



rally account for the statistical nature of the 
local image characteristics that define regions. 
This agrees with Julesz's conjecture of segmen- 
tation by image statistics (Julesz, 1962) without 
any restriction to being sensitive only to the first 
and second order image statistics. Julesz's con- 
cept of textons (Julesz, 1981) could be viewed in 
this framework as any feature to which the par- 
ticular intra-cortical interactions are sensitive 
and discriminatory. Using orientation depen- 
dent interactions between neurons, our model 
agrees with previous ideas (Northdurft, 1994) 
that (texture) segmentation is primarily driven 
by orientation contrast. However the emergent 
network behavior is collective and accommo- 
dates characteristics of general regions beyond 
elementary orientations, as in Fig. 6. Further- 
more, the psychophysical phenomena of filling- 
in (when one fails to notice a small blank region 
within a textured region) could be viewed in our 
framework as the instances when the network 
fails to highlight enough the non-homogeneity 
in inputs near the filled-in area. 

Our pre-attentive segmentation without clas- 
sification is quite primitive. It merely segments 
surface regions from each other, whether or not 
these regions belong to different visual objects. 
Furthermore, by not classifying, it does not 
characterize the region properties (such as by 
the 2+1/2 dimensional surface representations 
(Marr 1982)) more than what is already implic- 
itly present in the raw image pixels or the cell 
responses in VI. Hence, for example, our model 
does not say whether a region is made of a trans- 
parent surface on top of another surface. 

Our framework of segmentation without clas- 
sification suggests that one should find experi- 
mental evidences of pre-attentive segmentation 
preceding and dissociated from visual classifi- 
cation/discrimination. Recent experimental ev- 
idence from VI (Lamme, Zipser, and Spekrei- 
jse 1997, Zipser, private communication 1998) 
shows that the modulation of neural activities 
starts at the texture boundary and only later in- 
cludes the figure surface, where the neural mod- 
ulations take about 50 ms to develop after ini- 



tial cell responses (Zipser et al 1996, Zipser, pri- 
vate communication, 1998). Some psychophys- 
ical evidences (Scialfa and Joffe 1995) suggest 
that information regarding (figure) target pres- 
ence is available before information regarding 
feature values of the targets. V2 lesions in 
monkeys are shown to disrupt region content 
discrimination but not region border detection 
(Merigan, Mealey, and Maunsell, 1993). These 
results are consistent with our suggestion. Fur- 
thermore, neural modulation in VI, especially 
those in the figure surface (Zipser 1998, private 
communication), is strongly reduced or abol- 
ished by anaesthesia or lesions in higher visual 
areas (Lamme et al 1997), while experiments by 
Gallant et al (1995) show that activity modula- 
tion at texture boundaries is present even under 
anaesthesia. Taken together, these experimen- 
tal evidences suggest the plausibility of the fol- 
lowing computational framework. Pre-attentive 
segmentation without classification in VI pre- 
cedes region classification; region classification 
after pre-attentive segmentation is initialized in 
higher visual areas; the classification is then fed 
back to VI to give top-down influence and refine 
the segmentation (perhaps to remove the bias 
in the estimation of the border location in the 
example of Fig. 4B), this latter process might 
be attentive and can be viewed as segmentation 
by classification; the bottom-up and top-down 
loop can be iterated to improve both classifica- 
tion and segmentation. Top-down and bottom- 
up streams of processing have been studies by 
many others (e.g., Grenander 1976, Carpenter 
and Grossberg 1987, Ullman 1994, Dayan et al, 
1995). Our model is of the first step in the 
bottom up stream, which initializes the itera- 
tive loop. The neural circuit in our model can 
easily accommodate top-down feedback signals 
which, in addition to the VI mechanisms, selec- 
tively enhance or suppress the neural activities 
in VI (see examples in Li 1998). However, we 
have not yet modeled how higher visual centers 
process the bottom up signals to generate the 
feedback. 

The model's components and behavior are 



based on and consistent with experimental evi- 
dence (Rockland and Lund, 1983, White, 1989, 
Douglas and Martin, 1990, Gilbert, 1992, Noth- 
durft, 1994, Gallant et al, 1995). The exper- 
imentally testable predictions of the model in- 
clude the qualitative structure of the horizontal 
connections as in Fig. 2B, the dependence of 
the boundary highlights on the relative orienta- 
tion between texture bars and texture borders 
(e.g., in Fig. 3B), and the biases in the esti- 
mated border location by the neural responses 
(e.g., Fig. 4B). Since the model is quite simplis- 
tic in the connection design, I expect that there 
will be significant differences between the model 
and physiological connections. For instance, 
two linked bars interact in the model either via 
monosynaptic excitation or disynaptic inhibi- 
tion. In real cortex, two linked cells could often 
interact via both excitation and inhibition, mak- 
ing the overall strength of excitation or inhibi- 
tion input contrast dependent (e.g., Hirsch and 
Gilbert, 1991, see Li 1998 for analysis). Hence, 
the excitation (or inhibition) in our model could 
be interpreted as the abstraction of the pre- 
dominance of excitation (or inhibition) between 
two linked bars. Currently, different sources 
of experimental data on the connection struc- 
ture are not yet consistent with each other re- 
garding the spatial and orientation dependence 
of excitation and inhibition (Fitzpatrick 1996, 
Cavanaugh, Bair, Movshon 1997, Kapadia, pri- 
vate communication 1998, Hirsch and Gilbert 
1991, Polat, Mizobe, Pettet, Kasamatsu, Norcia 
1998), partly due to different experimental con- 
ditions like input contrast levels or the nature of 
stimulus elements (e.g., bars or gratings). Our 
model performance is also quantitatively depen- 
dent on input strength. One should bear this 
fact in mind when viewing the comparisons be- 
tween the model and experimental data in Figs. 
(4, 5, 9). 

The modulations of neural activity by cor- 
tical interactions should have perceptual con- 
sequences other than contour/region boundary 
enhancement and figure pop-out. For instance, 
the preferred orientation of the cells can shift 



depending on contextual bars. Under popula- 
tion coding, this will lead to tilt illusion, i.e., 
the change in perceived orientation of the tar- 
get bar. The perceived orientation of the target 
bar could shift away or towards the orientation 
of the contextual bars, depending on the spatial 
arrangement (and the orientations) of the con- 
textual bars. This is in contrast to the usual no- 
tion that the orientation of the target bar tends 
to shift away from those of the contextual bars. 
Both our model and some recent psychophysical 
study (Kapadia, private communication, 1998) 
confirm such contextual dependent distortion in 
perceived orientation. VI cells indeed display 
changes in orientation tunning under contextual 
influences (Gilbert and Wiesel 1990), although 
the magnitude and direction of the changes vary 
from cell to cell. 

Comparison with other models 

There are many other related models. Many 
cortical models are mainly concerned with con- 
tour linking, and the reference Li (1998) has 
a detailed citation of these models and com- 
parisons with our model. For instance, Gross- 
berg and his colleagues have developed models 
of visual cortex over many years (Grossberg and 
Mingolla 1985, Grossberg, Mingolla, and Ross, 
1997). They proposed their 'boundary contour 
system' as a model of intra-cortical and inter- 
areal neural interactions in VI and V2 and feed- 
back from V2 to VI. The model aims to cap- 
ture illusory contours which link line segments 
and line endings, and presumably such linking 
affects segmentation. Other models are more 
concerned with regions, namely, to classify re- 
gion features and then to segment regions by 
comparing the classifications. To obtain tex- 
ture region features, Malik and Perona (1990) 
use local intra-cortical inhibition. Geman and 
Geman built a model based on Markov ran- 
dom fields to restore images, in which neigh- 
boring image features influence each other sta- 
tistically (Geman and Geman, 1984). Such lo- 
cal interactions improve the outcomes from the 
prior and preliminary feature classifications to 
drive segmentation. Recently, Lee (1995) used 



a Bayesian framework to infer the region fea- 
tures and boundary signals from initial image 
measurements using gabor filters. The feature 
and boundary values influence each other to up- 
date their values in iterative steps to decrease 
an energy functional derived from the Bayesian 
framework. Lee (1995) suggested hypothetically 
that a VI circuit may implement this bayesian 
algorithm. 

Our model contrasts to previous models as the 
only one that models the effect of region bound- 
ary highlights in VI. Hence, it is also the only 
one that models contour enhancement and re- 
gion boundary highlights in the same neural cir- 
cuit. Equally, its instantiation in VI means that 
our model does not perform operations such as 
the classification and smoothing of region fea- 
tures and the sharpening of boundaries as car- 
ried out in some other models (e.g., Lee 1995, 
Malik and Perona 1990). Although there are 
many simulation and computational models of 
VI, if they are not designed for it, VI models 
are unlikely to perform region boundary high- 
lights or contour enhancement. The reference 
Li (1998) discussed the difficulties in a recurrent 
network even for mere contour enhancement us- 
ing only the elements and operations in VI. Our 
experience also shows that explicit design is nec- 
essary for a VI contour enhancement model to 
additionally perform region boundary highlights 
(i.e., to meet conditions (1) and (2) in section 3). 

Limitations and extensions of the model 

Our model is still very primitive compared to 
the true complexity of VI. We have yet to in- 
clude multiscale sampling or the over-complete 
input sampling strategy adopted by VI, or to 
include color, time, or stereo input dimensions. 
Also, the receptive field features used for our 
bar/edges should be determined more precisely. 
The details of the intra-cortical circuits within 
and between hypercolumns should also be better 
determined to match biological vision. 

Multiscale sampling is needed not only be- 
cause images contain multiscale features, but 
also to model VI responses to images from 
flat surfaces slanted in depth — such a region 
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should also be seen as "homogenous" or "trans- 
lation invariant" by VI, such that it has uniform 
saliency. Merely replicating and scaling the cur- 
rent model to multiple scales is not sufficient for 
this purpose. The computation requires inter- 
actions between different scales. We also have 
yet to find a better sampling distribution even 
within a single scale. Currently, the model neu- 
rons within the same hypercolumn have exactly 
the same RF centers and the RFs from differ- 
ent hypercolumns barely overlap. This sampling 
arrangement is sparse compared with VI sam- 
pling. Fig. (10) demonstrates the current model 
performance on a photo. The effects of single 
scale and sparse sampling (alising) are appar- 
ent in the model input image, which is more 
difficult than the photo image for human to seg- 
ment. However, the most salient model outputs 
do include the vertical column borders as well as 
some of the more conspicuous horizontal streaks 
in the photo. 

In addition to orientation and spatial loca- 
tion, neurons in VI are tuned for motion direc- 
tion/speed, disparity, ocularity, scale, and color 
(Hubel and Wiesel 1962, Livingstone and Hubel 
1984). Our model should be extended to stereo, 
time, and color dimensions. The horizontal con- 
nections in the extended model will link edge 
segments with compatible selectivities to scale, 
color, ocular dominance, disparity, and motion 
directions as well as orientations, as suggested 
by experimental data (e.g., Gilbert 1992, Ts'o 
and Gilbert 1988). The model should also ex- 
pand to include details such as on and off cells, 
cells of different RF phases, non-orientation se- 
lective cells, end stopped cells, and more cell lay- 
ers. These details should help for better quan- 
titative match between the model and human 
vision. For instance, Malik and Perona (1990) 
showed using psychophysical observations that 
the odd-symmetric receptive fields are not used 
for pre-attentive segmentation. The design of 
the horizontal connections between cells should 
respect these facts. 

Any given neural interaction will be more sen- 
sitive to some region differences than others. 



Therefore, the model sometimes finds it easier 
or more difficult to segment some regions than 
natural vision. Physiological and psychophysi- 
cal measurements of the boundary effect for dif- 
ferent types of textures can help to constrain the 
connection patterns in an improved model. Ex- 
periments also suggest that the connections may 
be learnable or plastic (Kami and Sagi, 1991, 
Sireteanu and Rieth 1991). It is desirable also 
to study the learning algorithms to develop the 
connections. 



We currently model saliency at each location 
quite coarsely by the activity of the most salient 
bar. It is mainly an experimental question as 
to how to best determine the saliency, and the 
model should accordingly be modified. This is 
particularly the case once the model includes 
multiple scales, non-orientation selective cells, 
and other visual input dimensions. The activi- 
ties from different channels should somehow be 
combined to determine the saliency at each lo- 
cation of the visual field. 



In summary, this paper proposes a computa- 
tional framework for pre-attentive segmentation 
— segmentation without classification. It intro- 
duces a simple and biological plausible model of 
VI to implement the framework using mecha- 
nisms of contextual influences via intra-cortical 
interactions. Although the model is yet very 
primitive compared to the real cortex, our re- 
sults show the feasibility of the underlying ideas, 
that region segmentation can occur without re- 
gion classification, that breakdown of transla- 
tion invariance can be used to segment regions, 
that region segmentation and contour detection 
can be addressed by the same mechanism, and 
that low-level processing in VI together with lo- 
cal contextual interactions can contribute signif- 
icantly to visual computations at global scales. 



Appendix: Design analysis for horizon- 
tal connections 

Connections J and W are designed to satisfy 
the 3 conditions listed in section 3. To illustrate, 
consider the example of a homogenous input 



hf, 



i", when 9 = 9 
0, otherwise 



(3) 



of a bar oriented 9 at every sampling point. By 
symmetry, a mean field solution (xi0,yie) is also 
independent of spatial location i. For simplicity 
assume x^ = for 9 ^ 9, and ignore all (x^, y^) 
with 9^9. Perturbations (x[ = x { g — x^, y[ = 
Vie ~ Vie) around the mean field solution follow 



Z = AZ 



(4) 



where Z 
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y ) . Matrix A results from 



expanding equations (1) and (2) around the 
mean field solution, it contains the horizontal 
connections Jigjg and W t g^g linking bar seg- 
ments oriented all at 9. Translation invariance 
in J and W implies that every eigenvector of 
A is a cosine wave in space for both x' and 
y'. To ensure condition (1), either every eigen- 
value of A should be negative so that no per- 
turbation from the homogeneous mean field so- 
lution is self-sustaining, or the eigenvalue with 
largest positive real part should correspond to 
the zero frequency cosine wave in space, in which 
case the deviation from the mean field solution 
tends to be homogeneous although it will oscil- 
late over time (Li, 1998). Iso-orientation sup- 
pression under supra-threshold input I is used 
to satisfy condition (2). This requires that every 
pyramidal cell x { g in an iso-orientation surround 
should receive stronger overall disynaptic inhi- 
bition than monosynaptic excitation: 



(J J2 W i8,j8>J2 J i 



S,jS 



(5) 
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where a = ip(0)g' (y^) comes from the inhibitory 
interneurons. The excitatory cells near a region 
boundary lack a complete iso-orientation sur- 
round, they are less suppressed and so exhibit 
stronger responses, meeting condition (2). We 



tested conditions (1) and (2) in simulations us- 
ing these simple and other general input config- 
urations including the cases when input within 
a region are of the form 1^ = Ig where Ig is non- 
zero for two orientation indices 9. Condition (3) 
is ensured by strong enough monosynaptic ex- 
citation Ejfl'econtour JiBjB 1 alon S an Y smooth con- 
tour extending from i0, and enough disynaptic 
inhibition between local, similarly oriented, and 
non-aligned bars to avoid enhancement of the 
noisy background (details in Li 1998), within 
the constraints of conditions (1) and (2). 
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Visual space, edge detectors, 
and their interactions 




A samplini 
location 



One of the edge 
detectors 



Neural connection pattern. 
Solid: J, Dashed: W 



C Model Neural Elements 

Edge outputs to higher visual areas 



An interconnected 
neuron pair for 
edge segment i 6 




Inputs Ic to 
inhibitory cells 



Visual inputs, filtered through the 
receptive fields, to the excitatory cells. 



Region 1 Region 2 

Figure 1: The two regions have the same feature val- 
ues. Traditional approaches to segmentation using feature 
extraction and comparison have difficulty in segmenting the 
regions. 



Figure 2: A: Visual inputs are sampled in a discrete 
grid by edge/bar detectors, modeling RFs for VI layer 2-3 
cells. Each grid point has K neuron pairs (see C), one per 
bar segment. All cells at a grid point share the same RF 
center, but are tuned to different orientations spanning 180°, 
thus modeling a hypercolumn. A bar segment in one hyper- 
column can interact with another in a different hypercolumn 
via monosynaptic excitation J (the solid arrow from one thick 
bar to another), or disynaptic inhibition W (the dashed ar- 
row to a thick dashed bar). See also C. B: A schematic of the 
neural connection pattern from the center (thick solid) bar 
to neighboring bars within a finite distance (a few RF sizes). 
J's contacts are shown by thin solid bars. W's are shown by 
thin dashed bars. All bars have the same connection pattern, 
suitably translated and rotated from this one. C: An input 
bar segment is associated with an interconnected pair of exci- 
tatory and inhibitory cells, each model cell models abstractly 
a local group of cells of the same type. The excitatory cell 
receives visual input and sends output g x (xie) to higher cen- 
ters. The inhibitory cell is an interneuron. The visual space 
has toroidal (wrap-around) boundary conditions. 
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A: Input image (lie) to model 



B: 



Model output 



C: Neural response levels 
vs. columns above 




D: Thresholded model output 



Figure 3: An example of the segmentation performance of the model. A: Input lie consists of two regions; each visible bar 
has the same input strength. B: Model output for A, showing non-uniform output strengths (temporal averages of g x {xig)) 
for the edges. The input and output strengths are proportional to the bar widths. C: Average output strengths (saliencies) 
vs. lateral locations of the columns in B, with the bar lengths proportional to the corresponding edge output strengthes. D: 
The thresholded output from B for illustration, thre = 0.5. Boundary saliency measures (r, d) = (4.5, 15.0). 
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Figure 4: A, B, C: Additional examples of model segmentation. Each is an input image as in Fig. 3A followed immediately 
below by the corresponding thresholded (strongest) model outputs as in Fig. 3D. In A, B, C respectively, the boundary 
measures are: (r,d) = (1.4, 9.0), (r,d) = (1.77, 12.2), (r,d) = (1.05, 1.24), and thre = 0.77, 0.902, 0.8775 to obtain the output 
highlights. D: Plots of boundary strengths (r,d) (symbols '+' and 'o' respectively) vs. orientation contrast at boundaries. A 
data point for each given orientation contrast is the average of 2 or 3 examples of different texture bar orientations. Again, 
each plotted region is only a small part of a larger extended image. Note that the most salient column in B is not exactly 
on the boundary, though the boundary column (on its left) is only 6% less salient numerically, and ~ 70% more salient than 
areas away from the boundary. Also, C contains two regions whose bar elements differ only slightly in orientation, giving a 
perceptually weak vertical boundary in the middle. Because of the noise in the system, the saliencies of the bars in the same 
column in A, B, C are not exactly the same, this is also the case in other figures. 



17 



A: orientation jitter 15°: input 

.~« \ / i i / / \ i i \ \ i \ i \ i \ i / / 

■■~-- \ i / / \ \ i / / \ i / i i \ i i \ / / 

■ — ■ — - 1 / \ i / / / i / \ i / 1 / 1 / \ i i i 

■ / i i \ i / / ii / / 1 i \ \ i i \ / \ 

■ --'- I \ I \ \ / / M I I I I / \ I \ I / \ 

-^ \ I \ / \ I / I / \ / I / \ / \ \ / / I 

I / I / / \ I I / \ \ I I / I i \ / I ; 

/ i i \ \ / \ m i \ i \ i \ i \ \ I I 

I / i \ / / / i \ i i \ i \ / i i \ \ \ 

/ / \ \ i i \ i \ / / ; i i I i ; I i I 

/ / / i / \ \ I i \ \ \ i I \ i ; / / \ 

Model output 
. ■ •• ii//\iit\i\i«i\iff 

•■ ; ; \ \ i i»l i i i i i i i \ / / 

^^ — ■ — ■ — ■-. h%\ i / / / i »\ i / i / I i \ II 

- ■•"•"-am i\i mimniiiiii 

.-«• — ■# i\\//i«imi/\i\i/\ 

--»»~-» — •■ \ i \ i 1 1\ 1 1 1 \ 1 1 \ 1 1 1 

» i*i / / \ I i*t \ I I ; I i \ i I ; 

•■ I \ \ / \ I » I w \ / \ I I w / 

1*1 \ I I I Itl I I I \l I I \ \\ 

»«■ ••> \ i i \ i*i 1 1 i i i i ; i i i 

» — » — »» Nuiiiuinuin 



D: 



bar spacing = 2: input 



Output highlights 



B: 



orientation jitter 30°: input 



-w//\\\//\i/ii\tn 

s\/\l///\/\\/\/l Ml 

-/ \ \ \ \ // 1 i / ; i i w/ 1 ■> 

-l\/\\//MIIII/\t\l 
-\l\/\\/l/\///\MW 
-l/W/\ll/\WIMM/ 
-// l\\/\l\l \ I \ I \ I \\ 
'\ / I \/ / / I \ / I \ I \/ / I N 
-// \ \ I \ \ \ \ / I I \ I I I / I 
W//WWI \\\\l I \\ I / 



Model output 






♦♦i\/;\i»\\i\i\*»i // 
•*// \ \ \ t*\ i/ i i \ i i \// 
•%\ i / / i\t\ \i\ / i i \ i i i 
*%\\ \ \ //ill 1 1 I \\t \\l\ 

-»-!#/ \\ / I » I I I I /\ I 1 I I \ 

//\\ — »»»*n — \m+ i \ w I #*/ // \i i\// I 

N s -#♦*-> v-« I »» ; /\ | | t\\ I I 11/1/1/ 

- /*l \\ f\l\\ \l\/tt\\tt 

-^-N#*~-» ^-.»»»l\/// 111 I \ I \*l I \\\ 

^**#*%*~**-»»#*\ \ / \ \ \ \tt I \ I I I / I I I 

*." »^^^->»#/ \;\\it\\\i i\\///\ 



E: bar spacing = 4: input 



Output highlights 



orientation jitter 45°: input 



* \ -- 



\-v^^' ^^V\- 



'/■ / / ~~/ 



I / I / / / / I I / I // I / \ I II 
///\///N\\/l\\\\/\/ 
IIII/W//////IM \\\ 
\ / I \ / W I I \\/ / / I I \ I I 
\\\\ \ / 1 \s / I \ \ / / \ss / 
\/\ll\l/\\\\l\//\/\ 
f \\\ / / \ // \ / \ I / l / I I I 
I \x I WW I / I \ / \ \/ // / 
I \ I /\ I I t /\ \\ I / I / \\s 
/ / / \ / / / \\///\ \ / / / 

'/\//Nt\\\|\|\/\l// 



Model output 



♦\- 



-/*\^%-+\i /■/ / / / i i / ■♦/ i i \ i i i 

^'#**/#»*/*/*V\»l WW / \# 

—»~*S* N^\\»»\/\W/*»/ I l\l\\\ 

♦-♦*"*^#\ — vi»»i\; in ix ♦• // 1 i\l» 

'~*«»*\ -*~\ v»\i ♦/ i \*l*\ i • / \t*t 

-//».»t»#w-~i(»> I I v !#»>»« I \i /♦*» 

\\sv^«H».»//*> M // I // l«l I / 1*1 I I 

^\\# #-»-■•%•■ \N | \\^\ I / t\ I \%*+S / 

#-♦-♦—# /~/X*\ l**l l#/\ »* /*#/ \\» 

o*~#/~^^. — ♦*'/•// \*/ I l\\/«M 1/ / / 
♦s#v-v->/\\WVl l»/»lt\\IM\J»l // 



14 
312 

310 



E 



Boundary strength vs. 
bar spacing. 
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Figure 5: The boundary strength changes with orientation noise and the spacings between the bars in the textures. A, 
B, C: Model inputs (lie) and outputs (g x (xie)) for two texture regions made of bars oriented, on average, respectively, 
horizontally and vertically. Each bar's orientation is randomly jittered from the average orientation by up to 15°, 30°, and 
45°, respectively. The orientation noise makes the saliency values quite non-uniform near the boundary, making the boundary 
measures (r, d) less meaningful. Boundary detection is difficult or impossible with orientation jitter > 30°. D, E: Model 
inputs (lie) and output (g x (xie)) highlights for two texture regions made of bars oriented horizontally and vertically. The 
spacing between neighboring bars are 2 and 4, respectively, grid spacings. F: Plots of boundary strengthes (r, d) (symbol '+' 
and 'o' respectively) vs. bar spacings for stimuli like D, E. To obtain output highlights in D, E respectively, thre = 0.92, 0.95. 
Note that although the boundary saliency is only a fraction higher than the non-boundary saliency as bar spacing increases, 
the boundary is still the most salient output when the region features are not noisy. The line widths for model outputs are 
plotted with one scale for A, B, C and another for D, E. 
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Figure 6: A, B, C: Model performance on regions with complex texture elements, and D: regions with stochastic texture 
elements. Each plot is the model input (lie) followed immediately below by the output {g x {xig)) highlights. For A, B, C, D 
respectively, the boundary measures r and d are (r, d) = (1.14, 6.3), (r, d) = (1.1, 2.0), (r, d) = (1.5, 4.5), and (r, d) = (2.56, 5.6), 
the threshold to generate the output highlights are thre = 0.91, 0.9, 0.85, 0.56. 
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Input (I ie ) 



Input (I ie ) 



Output highlights 



Output highlights 



..... 



Figure 7: Model behavior for other types of inputs. A: A small region pops out since all parts of it belong to the boundary. 
The figure saliency is 0.336, which is 2.42 times of the average ground saliency. B: Exactly the same model circuit (and 
parameters) performs contour enhancement. The input strength is lie = 1-2. The contour segments' saliencies are 0.42 ±0.03, 
and the background elements' saliencies are 0.18±0.08. To obtain the output highlights in A, B respectively, thre = 0.46, 0.73. 
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Figure 8: Asymmetry in pop-out strength. A: The cross is 3.4 times as salient (measured as the saliency of the horizontal 
bar in the cross) as the average background. B: The area near the central vertical bar is the most salient part in the image, 
and is no more than 1.2 as salient as the average background. The target bar itself is actually a bit less salient than the 
average background. 
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Figure 9: Model behavior under inputs resembling those in physiological experiments. The input stimuli are composed of 
a vertical (target) bar at the center surrounded by various contextual stimuli. All the visible bars have high contrast input 
lie = 3.5 except for the target bar in D where lie = 1-2 is near threshold. A, B, C simulate the experiments of Knierim and 
van Essen (1992) where a stimulus bar is surrounded by contextual textures of bars oriented parallel, orthogonal, or randomly 
to it, respectively. The saliencies of the (center) target bars in A, B, C are, respectively, 0.23, 0.74, and 0.41 (averaged 
over different random surrounds). An isolated bar of the same input strength would have a saliency 0.98. D simulates the 
experiment by Kapadia et al (1995) where a low contrast (center) target bar is aligned with some high contrast contextual 
bars to from a line in a background of randomly oriented high contrast bars. The target bar saliency is 0.39, about twice as 
salient as an isolated bar at the same (low) input strength, and roughly as salient as a typical (high input strength) background 
bar. Contour enhancement also holds in D when all bars have high input values, simulating the psychophysics experiment by 
(Field, Hayes, and Hess 1993). 
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Figure 10: Model behavior on a photo image. The input to the model is modeled as lie = (e 2 + o 2 ) 1 / 4 , where e and o 
are the outputs from the even and odd gabor-like niters at grid sampling point i with preferred orientation 0, the power 1/4 
coarsely models some degree of contrast gain control. At each grid point, bars of almost all orientations have nonzero input 
values I{0. For display clarity, no more than 2 strongest input or output orientations are plotted at each grid point in model 
input and output above. The second orientation bar is plotted only if input or output values at the grid point is not uni-modal, 
and the second strongest modal is at least 30% in strength of the strongest one. The strongest Iiq = 3.0 in the whole input. 
The more salient locations in the model output include some vertical borders of the columns in the input texture, as well as 
horizontal streaks, which are often also conspicuous in the original image. Note that this photo is sampled against a blank 
background on the left and right, hence the left and right sides of the photo area are also highlighted. 
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